GPU Computing in Julia

Biostat/Biomath M257

Author

Dr. Hua Zhou @ UCLA

Published

April 16, 2024

This lecture introduces GPU computing in Julia.

1 GPGPU

GPUs are ubiquitous in modern computers. Following are NVIDIA GPUs on today’s typical computer systems.

NVIDIA GPUs H100 PCIe RTX 6000 RTX 5000
H100 RTX 6000 RTX 5000
Computers servers, cluster desktop laptop
Server Desktop Laptop
Main usage scientific computing daily work, gaming daily work
Memory 80 GB 48 GB 16 GB
Memory bandwidth 2 TB/sec 960 GB/sec 576 GB/sec
Number of cores ??? ??? ???
Processor clock ??? GHz ??? GHz ??? GHz
Peak DP performance 26 TFLOPS ??? TFLOPS ??? TFLOPS
Peak SP performance 51 TFLOPS 91.1 TFLOPS 42.6 TFLOPS

2 GPU architecture vs CPU architecture

  • GPUs contain 1000s of processing cores on a single card; several cards can fit in a desktop PC

  • Each core carries out the same operations in parallel on different input data – single program, multiple data (SPMD) paradigm

  • Extremely high arithmetic intensity if one can transfer the data onto and results off of the processors quickly

i7 die Fermi die
Einstein Rain man

3 GPGPU in Julia

GPU support by Julia is under active development. Check JuliaGPU for currently available packages.

There are multiple paradigms to program GPU in Julia, depending on the specific hardware.

  • CUDA is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuSolve, CuDNN, …

    The CUDA.jl package allows defining arrays on Nvidia GPUs and overloads many common operations.

  • The AMDGPU.jl package allows defining arrays on AMD GPUs and overloads many common operations.

  • The Metal.jl package allows defining arrays on Apple Silicon and overloads many common operations.

  • The oneAPI.jl package allows defining arrays on Intel GPUs and overloads many common operations.

I’ll illustrate using Metal.jl on my MacBook Pro running MacOS Sonoma 14.4.1. It has Apple M2 chip with 38 GPU cores.

versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = 8
  JULIA_EDITOR = code

Load packages:

using Pkg

Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()
  Activating project at `~/Documents/github.com/ucla-biostat-257/2024spring/slides/09-juliagpu`
Status `~/Documents/github.com/ucla-biostat-257/2024spring/slides/09-juliagpu/Project.toml`
  [6e4b80f9] BenchmarkTools v1.5.0
  [bdcacae8] LoopVectorization v0.12.169
  [dde4c033] Metal v1.1.0
  [37e2e46d] LinearAlgebra

4 Query GPU devices in the system

using Metal

Metal.versioninfo()
macOS 14.4.1, Darwin 23.4.0

Toolchain:
- Julia: 1.10.2
- LLVM: 15.0.7

Julia packages: 
- Metal.jl: 1.1.0
- LLVMDowngrader_jll: 0.1.0+1

1 device:
- Apple M2 Max (4.000 GiB allocated)

5 Transfer data between main memory and GPU

using Random
Random.seed!(257)

# generate SP data on CPU
x = rand(Float32, 3, 3)
# transfer data form CPU to GPU
xd = MtlArray(x)
3×3 MtlMatrix{Float32, Private}:
 0.145793  0.939801  0.479926
 0.567772  0.577251  0.81655
 0.800538  0.38893   0.914135
# generate array on GPU directly
# yd = Metal.ones(3, 3)
yd = MtlArray(ones(Float32, 3, 3))
3×3 MtlMatrix{Float32, Private}:
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0
# collect data from GPU to CPU
x = collect(xd)
3×3 Matrix{Float32}:
 0.145793  0.939801  0.479926
 0.567772  0.577251  0.81655
 0.800538  0.38893   0.914135

6 Linear algebra

using BenchmarkTools, LinearAlgebra, Random

Random.seed!(257)

n = 2^14
# on CPU
x = rand(Float32, n, n)
y = rand(Float32, n, n)
z = zeros(Float32, n, n)
# on GPU
xd = MtlArray(x)
yd = MtlArray(y)
zd = MtlArray(z);

6.1 Dot product

# SP matrix dot product on GPU: tr(X'Y)
# why are there allocations?
bm_gpu = @benchmark Metal.@sync dot($xd, $yd)
BenchmarkTools.Trial: 693 samples with 1 evaluation.
 Range (minmax):  7.058 ms 10.672 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.163 ms                GC (median):    0.00%
 Time  (mean ± σ):   7.217 ms ± 280.146 μs   GC (mean ± σ):  0.00% ± 0.00%
   █▆                                                        
  ▅██▄▃▃▃▂▂▂▂▂▁▂▂▂▂▁▁▁▁▁▁▂▁▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  7.06 ms         Histogram: frequency by time        9.14 ms <
 Memory estimate: 26.30 KiB, allocs estimate: 956.
# SP matrix dot product on CPU: tr(X'Y)
bm_cpu = @benchmark dot($x, $y)
BenchmarkTools.Trial: 145 samples with 1 evaluation.
 Range (minmax):  34.256 ms 35.923 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     34.705 ms                GC (median):    0.00%
 Time  (mean ± σ):   34.707 ms ± 293.524 μs   GC (mean ± σ):  0.00% ± 0.00%
  ▃ ▂█          ▃▅▆ ▂ ▃▃  █▅       ▂                           
  ████▅█▅▄▁▄▁▄▇▄███████▅▅██▄█▄▁▇▅▅█▅▄▄▇▁▄▁▁▅▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▅ ▄
  34.3 ms         Histogram: frequency by time         35.6 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.
# speedup
median(bm_cpu.times) / median(bm_gpu.times)
4.844882645493413

6.2 Broadcast

# SP broadcast on GPU: z .= x .* y
# why is there allocation?
bm_gpu = @benchmark Metal.@sync $zd .= $xd .* $yd
BenchmarkTools.Trial: 553 samples with 1 evaluation.
 Range (minmax):  8.751 ms 11.202 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     8.894 ms                GC (median):    0.00%
 Time  (mean ± σ):   9.031 ms ± 286.161 μs   GC (mean ± σ):  0.00% ± 0.00%
   █▆▄▂▃▂                                                      
  ▆██████▅▄▄▃▂▁▃▂▂▄▄▅▅▆▅▅▅▆▇▅▄▅▃▃▄▃▂▁▂▁▃▁▁▁▁▁▁▁▁▁▁▁▁▂▃▂▁▂▂▂ ▃
  8.75 ms         Histogram: frequency by time        9.99 ms <
 Memory estimate: 5.34 KiB, allocs estimate: 197.
# SP broadcast on CPU: z .= x .* y
bm_cpu = @benchmark $z .= $x .* $y
BenchmarkTools.Trial: 141 samples with 1 evaluation.
 Range (minmax):  35.070 ms 37.173 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     35.480 ms                GC (median):    0.00%
 Time  (mean ± σ):   35.499 ms ± 240.034 μs   GC (mean ± σ):  0.00% ± 0.00%
             ▂▇▇                                              
  ▃▄▃▃▃▃▇▃▄▆████▆▆▄▄▁▃▁▃▃▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  35.1 ms         Histogram: frequency by time         36.9 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.
# speedup
median(bm_cpu.times) / median(bm_gpu.times)
3.989440201602489

6.3 Matrix multiplication

# SP matrix multiplication on GPU
bm_gpu = @benchmark Metal.@sync mul!($zd, $xd, $yd)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (minmax):  902.868 ms  1.085 s   GC (min … max): 0.00% … 0.00%
 Time  (median):     903.759 ms               GC (median):    0.00%
 Time  (mean ± σ):   939.934 ms ± 72.534 ms   GC (mean ± σ):  0.00% ± 0.00%
                                                               
  ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  903 ms          Histogram: frequency by time          1.08 s <
 Memory estimate: 1.12 KiB, allocs estimate: 62.

For this problem size on this machine, we see GPU achieves a staggering 9 TFLOPS throughput with single precision!

# SP throughput on GPU
(2n^3) / (minimum(bm_gpu.times) / 1e9)
9.742394985320246e12
# SP matrix multiplication on CPU
bm_cpu = @benchmark mul!($z, $x, $y)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 12.845 s (0.00% GC) to evaluate,
 with a memory estimate of 0 bytes, over 0 allocations.
# SP throughput on CPU
(2n^3) / (minimum(bm_cpu.times) / 1e9)
6.847891934440028e11

We see >10x speedup by GPUs in this matrix multiplication example.

# cholesky on Gram matrix
# This one doesn't seem to work on Apple M2 chip yet
# xtxd = xd'xd + I
# @benchmark Metal.@sync cholesky($(xtxd))
# xtx = collect(xtxd)
# @benchmark cholesky($(Symmetric(xtx)))

GPU speedup of Cholesky seems unavailable at the moment.

7 Evaluation of elementary and special functions on GPU

# elementwise function on GPU arrays
fill!(yd, 1)
bm_gpu = @benchmark Metal.@sync $zd .= log.($yd .+ sin.($xd))
bm_gpu
BenchmarkTools.Trial: 508 samples with 1 evaluation.
 Range (minmax):  9.283 ms 11.832 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.779 ms                GC (median):    0.00%
 Time  (mean ± σ):   9.821 ms ± 392.345 μs   GC (mean ± σ):  0.00% ± 0.00%
     ▄▁         ▅▅▆█▇▅                                       
  ▄▇▇██▄▃▃▂▃▃▃▃▇██████▆▅▃▃▃▁▁▁▁▁▂▂▂▂▁▁▁▂▁▁▁▁▁▂▁▁▃▄▂▃▄▄▃▄▃▃▃ ▄
  9.28 ms         Histogram: frequency by time        10.9 ms <
 Memory estimate: 5.34 KiB, allocs estimate: 197.
# elementwise function on CPU arrays
x, y, z = collect(xd), collect(yd), collect(zd)
bm_cpu = @benchmark $z .= log.($y .+ sin.($x))
bm_cpu
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (minmax):  2.457 s 2.471 s   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.463 s              GC (median):    0.00%
 Time  (mean ± σ):   2.464 s ± 7.091 ms   GC (mean ± σ):  0.00% ± 0.00%
   ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.46 s        Histogram: frequency by time        2.47 s <
 Memory estimate: 0 bytes, allocs estimate: 0.
# Speed up
median(bm_cpu.times) / median(bm_gpu.times)
251.8307707741242

GPU brings great speedup (>50x) to the massive evaluation of elementary math functions.